The dataset obtained from Electric Vehicle Population Data contains information on the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) registered through Washington State Department of Licensing (DOL). The owner of the dataset is the Department of Licensing, in this way, the data format is primary and internal. This dataset is open under the Open Database License ODbl.
The data consists of 17 columns describing some characteristics of the electric vehicles registered as March 21, 2025.
Objective: Understand which are the most frequently characteristics registered in the data set.
Particular objective: Learn to work with missing and zero values as well as with outliers.
To analyse data, the following questions arise:
## # A tibble: 5 × 17
## `VIN (1-10)` County City State `Postal Code` `Model Year` Make Model
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 5YJ3E1EBXK King Seattle WA 98178 2019 TESLA MODEL 3
## 2 5YJYGDEE3L Kitsap Poulsbo WA 98370 2020 TESLA MODEL Y
## 3 KM8KRDAF5P Kitsap Olalla WA 98359 2023 HYUNDAI IONIQ 5
## 4 5UXTA6C0XM Kitsap Seabeck WA 98380 2021 BMW X5
## 5 JTMAB3FV7P Thurston Rainier WA 98576 2023 TOYOTA RAV4 P…
## # ℹ 9 more variables: `Electric Vehicle Type` <chr>,
## # `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
## # `Electric Range` <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
## # `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
## # `2020 Census Tract` <chr>
| Name | Piped data |
| Number of rows | 235692 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | n_unique | empty | whitespace |
|---|---|---|---|---|
| VIN (1-10) | 0 | 13763 | 0 | 0 |
| County | 3 | 212 | 0 | 0 |
| City | 3 | 788 | 0 | 0 |
| State | 0 | 48 | 0 | 0 |
| Make | 0 | 46 | 0 | 0 |
| Model | 0 | 171 | 0 | 0 |
| Electric Vehicle Type | 0 | 2 | 0 | 0 |
| Clean Alternative Fuel Vehicle (CAFV) Eligibility | 0 | 3 | 0 | 0 |
| Vehicle Location | 10 | 957 | 0 | 0 |
| Electric Utility | 3 | 76 | 0 | 0 |
| 2020 Census Tract | 3 | 2204 | 0 | 0 |
Variable type: numeric
| skim_variable | n_missing |
|---|---|
| Postal Code | 3 |
| Model Year | 0 |
| Electric Range | 36 |
| Base MSRP | 36 |
| Legislative District | 494 |
| DOL Vehicle ID | 0 |
Summary statistic for model year and electric range
## Electric Range Model Year
## Min. : 0.00 Min. :2000
## 1st Qu.: 0.00 1st Qu.:2020
## Median : 0.00 Median :2023
## Mean : 46.26 Mean :2021
## 3rd Qu.: 38.00 3rd Qu.:2024
## Max. :337.00 Max. :2025
## NA's :36
Next, several column names have spaces and use capital letters. I prefer to work with variable names in lower case with underscores in the gaps.
I choose to work with the following 7 features: “city”, “make, model” , “model_year”, “electric_range”, “clean_alternative_fuel_vehicle_cafv_eligibility”, “electric_vehicle_type”. For easy manipulation, I also change the long names clean_alternative_fuel_vehicle_cafv_eligibility and electric_vehicle_type by the short ones eligibility and electric_type, respectively. In those columns I also rename the long name of the rows: Eligibililty unknown as battery range has not been researched is changed by unknown, Clean Alternative Fuel Vehicle Eligible by clean and Not eligible due to low battery range by negative.
Among the different features there is only a moderated correlation between eligibility and model year.
In the following, the selected features are grouped according to electric type.
| Name | Piped data |
| Number of rows | 235692 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 2 |
| ________________________ | |
| Group variables | electric_type |
Variable type: character
| skim_variable | electric_type | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| city | battery | 2 | 1 | 3 | 24 | 0 | 699 | 0 |
| city | hybrid | 1 | 1 | 3 | 24 | 0 | 531 | 0 |
| eligibility | battery | 0 | 1 | 5 | 8 | 0 | 3 | 0 |
| eligibility | hybrid | 0 | 1 | 5 | 8 | 0 | 2 | 0 |
| make | battery | 0 | 1 | 3 | 22 | 0 | 38 | 0 |
| make | hybrid | 0 | 1 | 3 | 20 | 0 | 27 | 0 |
| model | battery | 0 | 1 | 2 | 24 | 0 | 103 | 0 |
| model | hybrid | 0 | 1 | 2 | 17 | 0 | 72 | 0 |
Variable type: numeric
| skim_variable | electric_type | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|---|
| model_year | battery | 0 | 1 | 2021.64 | 2.78 | 2000 | 2021 | 2023 | 2024 | 2025 |
| model_year | hybrid | 0 | 1 | 2020.53 | 3.55 | 2010 | 2018 | 2022 | 2024 | 2025 |
| electric_range | battery | 0 | 1 | 50.16 | 93.66 | 0 | 0 | 0 | 73 | 337 |
| electric_range | hybrid | 36 | 1 | 31.27 | 14.60 | 6 | 21 | 30 | 38 | 153 |
The central tendency measures, mean and median (as p50), can be observed from the table above. The mode for several attributes is given in the following table
| electric_type | city | eligibility | model_year | make | model | electric_range |
|---|---|---|---|---|---|---|
| battery | Seattle | unknown | 2023 | TESLA | MODEL Y | 0 |
| hybrid | Seattle | clean | 2024 | TOYOTA | VOLT | 32 |
The column model year has a left skewed distribution. For battery electric type the mode is 2023, the same as the median, and the mean is 2022. The plug-in hybrid type has: mode 2024, median 2022 and mean 2021. The skewed distributions are shown in the distribution plots, mean and median are represented by dashed and dotted lines respectively. Outliers are observed in the Battery type case.
For electric range, the plots show right skewed distributions in both cases, battery and plug-in vehicle type. But battery type has a significantly skewed due to an extremely high value at range 0. This due to the observation described in the introduction.
For battery type: mean is 50, median and mode are 0. The plug-in hybrid type has: mean 31, median 30 and mode 32. In the following plot, the mean and median are represented by the dashed and dotted lines respectively.
| electric_type | max.range |
|---|---|
| battery | 337 |
| hybrid | 153 |
The table and the boxplot below show that between 2000 and 2010, the years are intermittent, for instance, years 2001, 2004 are not included. Also the values obtained between 2000 and 2015 are outliers since the number of vehicles is too small compared to the rest of the years. Such outliers are also observed in the distribution plot for model year.
| model_year | n |
|---|---|
| 2025 | 11176 |
| 2024 | 49044 |
| 2023 | 59893 |
| 2022 | 28958 |
| 2021 | 20615 |
| 2020 | 12265 |
| 2019 | 10974 |
| 2018 | 14368 |
| 2017 | 8570 |
| 2016 | 5306 |
| 2015 | 4661 |
| 2014 | 3407 |
| 2013 | 4230 |
| 2012 | 1490 |
| 2011 | 680 |
| 2010 | 23 |
| 2008 | 22 |
| 2003 | 1 |
| 2002 | 2 |
| 2000 | 7 |
The boxplots show several outliers in Battery Electric Vehicle Type and few outliers in Plug-in Hybrid Electric Vehicle Type.
In the distribution of electric range plot for battery, the electric range has an extremely high value at 0. But we have to remember that electric range was no longer researched for new BEVs because new cars had an electric range of 30 miles or more, in such case, 0 was captured for electric range.
All zero values are associated to battery electric vehicle type .
| electric_type | Count_0_range |
|---|---|
| battery | 139761 |
Distribution of electric range across the battery electric type.
| electric_range | n | pct_total_range | pct_battery_range |
|---|---|---|---|
| 0 | 139761 | 59.30% | 74.7% |
| 215 | 6403 | 2.70% | 3.4% |
| 238 | 4262 | 1.80% | 2.3% |
| 220 | 4057 | 1.70% | 2.2% |
| 84 | 3699 | 1.60% | 2.0% |
| 291 | 2365 | 1.00% | 1.3% |
| 208 | 2318 | 1.00% | 1.2% |
| 210 | 1836 | 0.80% | 1.0% |
| 75 | 1773 | 0.80% | 0.9% |
| 322 | 1719 | 0.70% | 0.9% |
There are 139 761 vehicles with 0 range, with respect to dataset length, this is the 59% of the dataset!. With respect to the battery vehicle type this amount is the 74.7%! This is a considerable amount of data with range 0 that is modifying the statistic for this variable, doing it significantly right skewed. Compared to this, the percentage of electric range above 190 is extremely low, therefore these values appear as outliers in the boxplots.
Now, I explore the Clean Alternative Fuel Vehicle (CAFV) Eligibility which I called it simply eligibility for short. There are three classes:
## [1] "clean" "unknown" "negative"
Where
With respect to the length of the dataset, the percentage of each class in eligibility is shown in the table.
| eligibility | Number.Of.Eligibility | Percentage |
|---|---|---|
| clean | 73317 | 31% |
| negative | 22614 | 10% |
| unknown | 139761 | 59% |
From this table, the eligibility unknown has the higher amount of entries.
Is there a relation of between the 0’s in electric range with the unknown category?
## [1] "number of clean eligibility with range 0 : 0"
## [1] "number of eligibility unknown with range 0 : 139761"
## [1] "number of negative with range 0 : 0"
All vehicles (139 761) that where registered with 0 range were classified with eligibility unknown and type battery.
| electric_type | Median.Clean | Mean.Clean |
|---|---|---|
| battery | 215 | 199 |
| hybrid | 38 | 41 |
The mean and median of the eligibility class negative is below 41 in both type of electric vehicles . There is no unknown registrations for the plug-in type. The eligibility unknown is only registered for battery type, there is not registration of unknown eligibility for plug-in hybrid type.
## model_year
## Min. :2000
## 1st Qu.:2016
## Median :2018
## Mean :2018
## 3rd Qu.:2019
## Max. :2024
The median of model_year for the eligibility class is 2018.
Which model years are associated to these vehicles with 0 range and eligibility unknown? `
The plots above show that for battery electric type, there are 0 range values for year model 2008. Then, the rest of vehicles registered with 0 range correspond to model years between 2019 and 2025. The model years 2022, 2023 and 2025 have only 0 range values.
Is there a relation between the number of 0 range values with make?
It seems that several makes have associated 0 range value.
Summarizing exploration about zero modes:
The zero electric range values correspond to the class Eligibililty unknown as battery range has not been researched and to the battery electric vehicle type only. Most of these values were registered between 2019 and 2025.
For range different to zero and type battery there are associated two eligibility classes: clean and negative.
The clean class has an electric range median of 215 and a median of 2018 for the model year.
The negative class have range values below 41.
## city eligibility model_year make model
## 3 0 0 0 0
## electric_range electric_type
## 36 0
The feature electric range has also missing values.
## [1] "percentage_miss_range: 0"
The percentage of missing values in electric_range is 0.015 %, this is to small to do any harm. But for a better understanding of the missing values, I explore the columns in order to see where are such missing values.
## # A tibble: 1 × 2
## model_year electric_type
## <dbl> <chr>
## 1 2025 hybrid
The 36 missing values correspond to model_year 2025 and to electric_type plug-in hybrid electric Vehicle (PHEV).
It seems that the outliers in model year feature is due to the fact that electric vehicles were not popular during the initial years. Therefore, for my analysis I consider Model Years from 2011 till 2025.
Next, I replace the zero range values in battery electric type by 215, this is the median of the electric range registered in the clean eligibility.
## # A tibble: 1 × 1
## Observations_0_range
## <int>
## 1 0
| electric_type | Electric.Range |
|---|---|
| battery | 215 |
| hybrid | 32 |
Although I have replaced the 0 values by the median of rnge in class clean, it has not solved the problem of outliers. Replacing the 0 values with the mean, moves the mode to the right. New outliers are still present because the number of the replaced values is more than half of the total data. There is not correlation between the features, except the slight correlation between model year and eligibility, the outliers does not affect the results of the analysis on the other features.
| electric_type | n | perc | labels |
|---|---|---|---|
| hybrid | 48692 | 0.2066399 | 21% |
| battery | 186945 | 0.7933601 | 79% |
## [1] "TESLA" "HYUNDAI" "BMW"
## [4] "TOYOTA" "NISSAN" "KIA"
## [7] "POLESTAR" "MAZDA" "CHEVROLET"
## [10] "VOLVO" "JEEP" "FIAT"
## [13] "LINCOLN" "AUDI" "DODGE"
## [16] "RIVIAN" "VOLKSWAGEN" "FORD"
## [19] "HONDA" "PORSCHE" "MITSUBISHI"
## [22] "LEXUS" "JAGUAR" "SMART"
## [25] "CHRYSLER" "MERCEDES-BENZ" "GMC"
## [28] "MINI" "SUBARU" "CADILLAC"
## [31] "ACURA" "LAND ROVER" "GENESIS"
## [34] "LUCID" "ALFA ROMEO" "FISKER"
## [37] "VINFAST" "BENTLEY" "MULLEN AUTOMOTIVE INC."
## [40] "BRIGHTDROP" "TH!NK" "LAMBORGHINI"
## [43] "AZURE DYNAMICS" "ROLLS-ROYCE" "RAM"
There are 45 manufactures (makes). ### Makes across electric type
| make | Observations.Make.Battery |
|---|---|
| TESLA | 101037 |
| NISSAN | 15532 |
| CHEVROLET | 12426 |
| FORD | 8874 |
| KIA | 8074 |
| RIVIAN | 6750 |
| HYUNDAI | 6331 |
| VOLKSWAGEN | 5976 |
| BMW | 3850 |
| AUDI | 2462 |
| make | Observations.Make.Plugin |
|---|---|
| TOYOTA | 8077 |
| JEEP | 5951 |
| BMW | 5797 |
| CHEVROLET | 4709 |
| VOLVO | 3929 |
| CHRYSLER | 3786 |
| FORD | 3724 |
| KIA | 3271 |
| AUDI | 1898 |
| HYUNDAI | 1075 |
## # A tibble: 48 × 3
## # Groups: electric_type [2]
## electric_type city n
## <chr> <chr> <int>
## 1 battery Auburn 2224
## 2 battery Bainbridge Island 1611
## 3 battery Bellevue 9963
## 4 battery Bellingham 2887
## 5 battery Bonney Lake 1155
## 6 battery Bothell 6752
## 7 battery Bremerton 1412
## 8 battery Burien 1038
## 9 battery Camas 1635
## 10 battery Edmonds 1939
## # ℹ 38 more rows
Seattle is the city with the highest registered vehicles battery and plug-in type. Across the Washington Cities there are more battery type vehicles registered than plug-in.
Most of the electric vehicles registered between 2011 to 2025 are battery type, with near 80 % of total number of registered vehicles. Among electric vehicles in this class, the most frequently registered are:
For the plug-in electric vehicle type, the mos frequently entries are: